π Load required
libraries
library(ggplot2)
library(htmlwidgets)
library(igraph)
library(monocle3)
library(plotly)
set.seed(12345)
𧬠Constructing
single-cell trajectories
During development, in response to stimuli, and throughout life,
cells transition from one functional βstateβ to another. Cells in
different states express different sets of genes, producing a dynamic
repetoire of proteins and metabolites that carry out their work. As
cells move between states, they undergo a process of transcriptional
re-configuration, with some genes being silenced and others newly
activated. These transient states are often hard to characterize because
purifying cells in between more stable endpoint states can be difficult
or impossible. Single-cell RNA-Seq can enable you to see these states
without the need for purification. However, to do so, we must determine
where each cell is in the range of possible states. For a comparison of
different trajectory inference methods, see Saelens et al.Β (2019) (Saelens et al. 2019).
Monocle 3 (Cao et al. 2019) is the
latest version of the Monocle toolkit. Monocle introduced the strategy
of using RNA-Seq for single-cell trajectory analysis (Trapnell et al. 2014). Rather than purifying
cells into discrete states experimentally, Monocle uses an algorithm to
learn the sequence of gene expression changes each cell must go through
as part of a dynamic biological process. Once it has learned the overall
βtrajectoryβ of gene expression changes, Monocle can place each cell at
its proper position in the trajectory. You can then use Monocleβs
differential analysis toolkit to find genes regulated over the course of
the trajectory, as described in the section Finding genes that
change as a function of pseudotime. If there are multiple outcomes
for the process, Monocle will reconstruct a βbranchedβ trajectory. These
branches correspond to cellular βdecisionsβ, and Monocle provides
powerful tools for identifying the genes affected by them and involved
in making them. You can see how to analyze branches in the section
Analyzing branches in single-cell trajectories.
The workflow for reconstructing trajectories is very similar to the
workflow for clustering, but it has a few additional steps. To
illustrate the workflow, we will use another C. elegans data
set, this one from Packer & Zhu et al. (Packer et al. 2019). Their study includes a
time series analysis of whole developing embyros. We will examine a
small subset of the data which includes most of the neurons.
π₯ Loading the
data
We will load the data as we did with the L2 data. This dataset
contains an expression matrix, cell metadata, and gene annotations.
expression_matrix <- readRDS(url("https://depts.washington.edu:/trapnell-lab/software/monocle3/celegans/data/packer_embryo_expression.rds"))
cell_metadata <- readRDS(url("https://depts.washington.edu:/trapnell-lab/software/monocle3/celegans/data/packer_embryo_colData.rds"))
gene_annotation <- readRDS(url("https://depts.washington.edu:/trapnell-lab/software/monocle3/celegans/data/packer_embryo_rowData.rds"))
cds <- new_cell_data_set(expression_matrix,
cell_metadata = cell_metadata,
gene_metadata = gene_annotation)
βοΈ Pre-process the
data
Pre-processing works exactly as in clustering analysis. This time, we
will use a different strategy for batch correction, which includes what
Packer & Zhu et al did in their original analysis.
Note: Your data will not have the loading batch
information demonstrated here, you will correct batch using your own
batch information.
cds <- preprocess_cds(cds, num_dim = 50)
cds <- align_cds(cds, alignment_group = "batch", residual_model_formula_str = "~ bg.300.loading + bg.400.loading + bg.500.1.loading + bg.500.2.loading + bg.r17.loading + bg.b01.loading + bg.b02.loading")
Note that in addition to using the alignment_group
argument to align_cds(), which aligns groups of cells
(i.e.Β batches), we are also using
residual_model_formula_str. This argument is for
subtracting continuous effects. You can use this to control for things
like the fraction of mitochondrial reads in each cell, which is
sometimes used as a QC metric for each cell. In this experiment (as in
many scRNA-seq experiments), some cells spontanously lyse, releasing
their mRNAs into the cell suspension immediately prior to loading into
the single-cell library prep. This βsupernatant RNAβ contaminates each
cellsβ transcriptome profile to a certain extent. Fortunately, it is
fairly straightforward to estimate the level of background contamination
in each batch of cells and subtract it, which is what Packer et al did
in the original study. Each of the columns bg.300.loading,
bg.400.loading, corresponds to a background signal that a
cell might be contaminated with. Passing these colums as terms in the
residual_model_formula_str tells align_cds()
to subtract these signals prior to dimensionality reduction, clustering,
and trajectory inference. Note that you can call
align_cds() with alignment_group,
residual_model_formula, or both.
π¨ Reduce
dimensionality and visualize the results
Next, we reduce the dimensionality of the data. However, unlike
clustering, which works well with both UMAP and t-SNE, here we strongly
urge you to use UMAP, the default method:
cds <- reduce_dimension(cds)
p1 <- plot_cells(cds, label_groups_by_cluster=FALSE, color_cells_by = "cell.type")
ggsave("figures/monocle3_trajectory_umap.pdf", p1, width=12, height=10)
ggsave("figures/monocle3_trajectory_umap.jpg", p1, width=12, height=10)
Figure 1: UMAP of C. elegans embryonic cells colored by
cell type
As you can see, despite the fact that we are only looking at a small
slice of this dataset, Monocle reconstructs a trajectory with numerous
branches. Overlaying the manual annotations on the UMAP reveals that
these branches are principally occupied by one cell type.
As with clustering analysis, you can use plot_cells() to
visualize how individual genes vary along the trajectory. Letβs look at
some genes with interesting patterns of expression in ciliated
neurons:
ciliated_genes <- c("che-1",
"hlh-17",
"nhr-6",
"dmd-6",
"ceh-36",
"ham-1")
p2 <- plot_cells(cds,
genes=ciliated_genes,
label_cell_groups=FALSE,
show_trajectory_graph=FALSE)
ggsave("figures/monocle3_trajectory_ciliated_genes.pdf", p2, width=12, height=10)
ggsave("figures/monocle3_trajectory_ciliated_genes.jpg", p2, width=12, height=10)
Figure 2: Expression of ciliated neuron genes along the
trajectory
We will learn how to identify the genes that are restricted to each
outcome of the trajectory later on in the section Finding genes that
change as a function of pseudotime.
π¬ Cluster your
cells
Although cells may continuously transition from one state to the next
with no discrete boundary between them, Monocle does not assume that all
cells in the dataset descend from a common transcriptional βancestorβ.
In many experiments, there might in fact be multiple distinct
trajectories. For example, in a tissue responding to an infection,
tissue resident immune cells and stromal cells will have very different
initial transcriptomes, and will respond to infection quite differently,
so they should be a part of the same trajectory.
Monocle is able to learn when cells should be placed in the same
trajectory as opposed to separate trajectories through its clustering
procedure. Recall that we run cluster_cells(), each cell is
assigned not only to a cluster but also to a partition. When you are
learning trajectories, each partition will eventually become a separate
trajectory. We run cluster_cells() as before.
cds <- cluster_cells(cds)
p3 <- plot_cells(cds, color_cells_by = "partition")
ggsave("figures/monocle3_trajectory_partitions.pdf", p3, width=12, height=10)
ggsave("figures/monocle3_trajectory_partitions.jpg", p3, width=12, height=10)
Figure 3: UMAP of C. elegans embryonic cells colored by
partition
πΈοΈ Learn the trajectory
graph
Next, we will fit a principal graph within each partition using the
learn_graph() function:
p4 <- plot_cells(cds,
color_cells_by = "cell.type",
label_groups_by_cluster=FALSE,
label_leaves=FALSE,
label_branch_points=FALSE)
ggsave("figures/monocle3_trajectory_graph.pdf", p4, width=12, height=10)
ggsave("figures/monocle3_trajectory_graph.jpg", p4, width=12, height=10)
Figure 4: UMAP of C. elegans embryonic cells with
learned trajectory graph
This graph will be used in many downstream steps, such as branch
analysis and differential expression.
β³ Order the cells in
pseudotime
Once weβve learned a graph, we are ready to order the cells according
to their progress through the developmental program. Monocle measures
this progress in pseudotime.
What is pseudotime?
Pseudotime is a measure of how much progress an individual cell has
made through a process such as cell differentiation.
In many biological processes, cells do not progress in perfect
synchrony. In single-cell expression studies of processes such as cell
differentiation, captured cells might be widely distributed in terms of
progress. That is, in a population of cells captured at exactly the same
time, some cells might be far along, while others might not yet even
have begun the process. This asynchrony creates major problems when you
want to understand the sequence of regulatory changes that occur as
cells transition from one state to the next. Tracking the expression
across cells captured at the same time produces a very compressed sense
of a geneβs kinetics, and the apparent variability of that geneβs
expression will be very high.
By ordering each cell according to its progress along a learned
trajectory, Monocle alleviates the problems that arise due to
asynchrony. Instead of tracking changes in expression as a function of
time, Monocle tracks changes as a function of progress along the
trajectory, which we term βpseudotimeβ. Pseudotime is an abstract unit
of progress: itβs simply the distance between a cell and the start of
the trajectory, measured along the shortest path. The trajectoryβs total
length is defined in terms of the total amount of transcriptional change
that a cell undergoes as it moves from the starting state to the end
state.
In order to place the cells in order, we need to tell Monocle where
the βbeginningβ of the biological process is. We do so by choosing
regions of the graph that we mark as βrootsβ of the trajectory. In time
series experiments, this can usually be accomplished by finding spots in
the UMAP space that are occupied by cells from early time points:
p5 <- plot_cells(cds,
color_cells_by = "embryo.time.bin",
label_cell_groups=FALSE,
label_leaves=TRUE,
label_branch_points=TRUE,
graph_label_size=1.5)
ggsave("figures/monocle3_trajectory_time_bin.pdf", p5, width=12, height=10)
ggsave("figures/monocle3_trajectory_time_bin.jpg", p5, width=12, height=10)
Figure 5: UMAP of C. elegans embryonic cells colored by
embryo time bin
The black lines show the structure of the graph. Note that the graph
is not fully connected: cells in different partitions are in distinct
components of the graph. The circles with numbers in them denote special
points within the graph. Each leaf, denoted by light gray circles,
corresponds to a different outcome (i.e.Β cell fate) of the trajectory.
Black circles indicate branch nodes, in which cells can travel to one of
several outcomes. You can control whether or not these are shown in the
plot with the label_leaves and
label_branch_points arguments to plot_cells.
Please note that numbers within the circles are provided for reference
purposes only.
Now that we have a sense of where the early cells fall, we can call
order_cells(), which will calculate where each cell falls
in pseudotime. In order to do so order_cells() needs you to
specify the root nodes of the trajectory graph. If you donβt provide
them as an argument, it will launch a graphical user interface for
selecting one or more root nodes.
# a helper function to identify the root principal points:
get_earliest_principal_node <- function(cds, time_bin="130-170"){
cell_ids <- which(colData(cds)[, "embryo.time.bin"] == time_bin)
closest_vertex <-
cds@principal_graph_aux[["UMAP"]]$pr_graph_cell_proj_closest_vertex
closest_vertex <- as.matrix(closest_vertex[colnames(cds), ])
root_pr_nodes <-
igraph::V(principal_graph(cds)[["UMAP"]])$name[as.numeric(names
(which.max(table(closest_vertex[cell_ids,]))))]
root_pr_nodes
}
cds <- order_cells(cds, root_pr_nodes=get_earliest_principal_node(cds))
In the above example, we just chose one location, but you could pick
as many as you want. Plotting the cells and coloring them by pseudotime
shows how they were ordered:
p6 <- plot_cells(cds,
color_cells_by = "pseudotime",
label_cell_groups=FALSE,
label_leaves=FALSE,
label_branch_points=FALSE,
graph_label_size=1.5)
ggsave("figures/monocle3_trajectory_pseudotime_interactive.pdf", p6, width=12, height=10)
ggsave("figures/monocle3_trajectory_pseudotime_interactive.jpg", p6, width=12, height=10)
Figure 6: UMAP of C. elegans embryonic cells colored by
pseudotime
Note that some of the cells are gray. This means they have infinite
pseudotime, because they were not reachable from the root nodes that
were picked. In general, any cell on a partition that lacks a root node
will be assigned an infinite pseudotime. In general, you should choose
at least one root per partition.
Itβs often desirable to specify the root of the trajectory
programmatically, rather than manually picking it. The function below
does so by first grouping the cells according to which trajectory graph
node they are nearest to. Then, it calculates what fraction of the cells
at each node come from the earliest time point. Then it picks the node
that is most heavily occupied by early cells and returns that as the
root.
# a helper function to identify the root principal points:
get_earliest_principal_node <- function(cds, time_bin="130-170"){
cell_ids <- which(colData(cds)[, "embryo.time.bin"] == time_bin)
closest_vertex <-
cds@principal_graph_aux[["UMAP"]]$pr_graph_cell_proj_closest_vertex
closest_vertex <- as.matrix(closest_vertex[colnames(cds), ])
root_pr_nodes <-
igraph::V(principal_graph(cds)[["UMAP"]])$name[as.numeric(names
(which.max(table(closest_vertex[cell_ids,]))))]
root_pr_nodes
}
cds <- order_cells(cds, root_pr_nodes=get_earliest_principal_node(cds))
Passing the programatically selected root node to
order_cells() via the root_pr_nodes argument
yields:
p7 <- plot_cells(cds,
color_cells_by = "pseudotime",
label_cell_groups=FALSE,
label_leaves=FALSE,
label_branch_points=FALSE,
graph_label_size=1.5)
ggsave("figures/monocle3_trajectory_pseudotime_programmatic.pdf", p7, width=12, height=10)
ggsave("figures/monocle3_trajectory_pseudotime_programmatic.jpg", p7, width=12, height=10)
Figure 7: UMAP of C. elegans embryonic cells colored by
pseudotime
Note that we could easily do this on a per-partition basis by first
grouping the cells by partition using the partitions()
function. This would result in all cells being assigned a finite
pseudotime.
πΏ Subset cells by
branch
It is often useful to subset cells based on their branch in the
trajectory. The function choose_graph_segments allows you
to do so interactively.
cds_sub <- choose_graph_segments(cds)
π§ Working with 3D
trajectories
You can also perform trajectory analysis in 3D.
cds_3d <- reduce_dimension(cds, max_components = 3)
cds_3d <- cluster_cells(cds_3d)
cds_3d <- learn_graph(cds_3d)
cds_3d <- order_cells(cds_3d, root_pr_nodes=get_earliest_principal_node(cds))
# Get the UMAP coordinates and partition information
umap_coords <- reducedDims(cds_3d)$UMAP
cell_partitions <- partitions(cds_3d)
# Create a data frame for plotting
plot_df <- data.frame(
UMAP_1 = umap_coords[,1],
UMAP_2 = umap_coords[,2],
UMAP_3 = umap_coords[,3],
Partition = factor(cell_partitions)
)
# Create and display interactive 3D plot with plotly
# Create the 3D plot
fig <- plot_ly(plot_df,
x = ~UMAP_1,
y = ~UMAP_2,
z = ~UMAP_3,
color = ~Partition,
type = "scatter3d",
mode = "markers",
marker = list(size = 3)) %>%
layout(scene = list(
xaxis = list(title = "UMAP_1"),
yaxis = list(title = "UMAP_2"),
zaxis = list(title = "UMAP_3")
))
# Display the plot directly
fig
References
Cao, J., M. Spielmann, X. Qiu, X. Huang, D. M. Ibrahim, A. J. Hill, F.
Zhang, et al. 2019.
βThe Single-Cell Transcriptional Landscape of
Mammalian Organogenesis.β Journal Article.
Nature 566
(7745): 496β502.
https://doi.org/10.1038/s41586-019-0969-x.
Packer, J. S., Q. Zhu, C. Huynh, P. Sivaramakrishnan, E. Preston, H.
Dueck, D. Stefanik, et al. 2019.
βA Lineage-Resolved Molecular
Atlas of c. Elegans Embryogenesis at Single-Cell Resolution.β
Journal Article.
Science 365 (6459).
https://doi.org/10.1126/science.aax1971.
Saelens, W., R. Cannoodt, H. Todorov, and Y. Saeys. 2019.
βA
Comparison of Single-Cell Trajectory Inference Methods.β Journal
Article.
Nat Biotechnol 37 (5): 547β54.
https://doi.org/10.1038/s41587-019-0071-9.
Trapnell, C., D. Cacchiarelli, J. Grimsby, P. Pokharel, S. Li, M. Morse,
N. J. Lennon, K. J. Livak, T. S. Mikkelsen, and J. L. Rinn. 2014.
βThe Dynamics and Regulators of Cell Fate Decisions Are Revealed
by Pseudotemporal Ordering of Single Cells.β Journal Article.
Nat Biotechnol 32 (4): 381β86.
https://doi.org/10.1038/nbt.2859.
---
title: "🏗️ Constructing single-cell trajectories with Monocle3🧜"
author: "Hongchen Xiao"
date: "`r Sys.Date()`"
bibliography: ["Bioinformatics.bib"]
output: 
  html_document:
    toc: true
    toc_float: 
      collapsed: TRUE
      smooth_scroll: true
    number_sections: true
    code_folding: show
    theme: cosmo
    highlight: tango
    df_print: paged
    fig_caption: true
    code_download: true
---


```{r setup, include=FALSE}
knitr::opts_chunk$set(
  echo = TRUE,
  warning = FALSE,
  message = FALSE,
  error = FALSE,
  comment = "#>",
  fig.align = "center",
  out.width = "80%",
  dpi = 300,
  cache = TRUE
)

```


```{r klippy, echo=FALSE, include=TRUE}
library(klippy)
    klippy::klippy(position = c('bottom', 'right'),
                   lang = c('r', 'python', 'bash'),
                   tooltip_message = 'Click to copy', 
                   tooltip_success = 'Copied!',
                   color = 'darkRed')
```


```{css, echo=FALSE}
/* Custom CSS styling */
body {
  font-family: 'Helvetica Neue', Arial, sans-serif;
  line-height: 1.8;
  max-width: 1200px;
  margin: auto;
  padding: 2em;
  background-color: #ffffff;
  color: #2c3e50;
}

h1, h2, h3, h4 {
  color: #2c3e50;
  font-weight: 600;
  margin-top: 2em;
  border-bottom: 2px solid #eaecef;
  padding-bottom: 0.3em;
}

.title {
  color: #2c3e50;
  text-align: center;
  font-size: 2.8em;
  margin-bottom: 1em;
  border-bottom: none;
  font-weight: 700;
}

.author, .date {
  text-align: center;
  color: #7f8c8d;
  margin-bottom: 2em;
}

/* Code blocks */
pre {
  background-color: #f8f9fa !important;
  border: 1px solid #e9ecef;
  border-radius: 4px;
  padding: 1em;
  margin: 1em 0;
  overflow-x: auto;
}

code {
  color: #e83e8c;
  background-color: #f8f9fa;
  padding: 2px 4px;
  border-radius: 3px;
}

pre.r:before {
  content: "R Script";
  display: block; /* Make the label appear on its own line */
  font-weight: bold;
  color: #555;
  background-color: #eee;
  padding: 5px
}

pre.bash:before {
  content: "Bash Script";
  display: block; /* Make the label appear on its own line */
  font-weight: bold;
  color: #555;
  background-color: #eee;
  padding: 5px;
}

/* Table of Contents styling */
#TOC {
  padding: 20px;
  background: #f8f9fa;
  border-radius: 5px;
  border: 1px solid #e9ecef;
}

.tocify {
  border: none;
}

.tocify .tocify-item a {
  color: #2c3e50;
}

.tocify .tocify-item.active a {
  color: #007bff;
  border-left: 3px solid #007bff;
}

/* Images */
img {
  max-width: 100%;
  height: auto;
  margin: 2em auto;
  display: block;
  box-shadow: 0 4px 6px rgba(0, 0, 0, 0.1);
  border-radius: 4px;
}

/* Lists */
ul, ol {
  padding-left: 1.5em;
  margin-bottom: 1.5em;
}

li {
  margin-bottom: 0.5em;
}

/* Blockquotes */
blockquote {
  border-left: 4px solid #007bff;
  padding-left: 1em;
  margin: 1em 0;
  color: #6c757d;
}

/* Tables */
table {
  width: 100%;
  margin: 2em 0;
  border-collapse: collapse;
}

th, td {
  padding: 12px;
  border: 1px solid #e9ecef;
}

th {
  background-color: #f8f9fa;
  font-weight: 600;
}

/* Links */
a {
  color: #007bff;
  text-decoration: none;
}

a:hover {
  color: #0056b3;
  text-decoration: underline;
}

/* Section spacing */
section {
  margin-bottom: 3em;
}

/* Code output */
.output {
  background-color: #f8f9fa;
  border: 1px solid #e9ecef;
  border-radius: 4px;
  padding: 1em;
  margin-top: 1em;
}

/* Warning and note boxes */
.note, .warning {
  padding: 1em;
  margin: 1em 0;
  border-radius: 4px;
}

.note {
  background-color: #e3f2fd;
  border: 1px solid #90caf9;
}

.warning {
  background-color: #fff3e0;
  border: 1px solid #ffb74d;
}

```

## 🔃 Load required libraries

```{r load-libraries}
library(ggplot2)
library(htmlwidgets)
library(igraph)
library(monocle3)
library(plotly)

set.seed(12345)
```

## 🧬 Constructing single-cell trajectories

During development, in response to stimuli, and throughout life, cells transition from one functional "state" to another. Cells in different states express different sets of genes, producing a dynamic repetoire of proteins and metabolites that carry out their work. As cells move between states, they undergo a process of transcriptional re-configuration, with some genes being silenced and others newly activated. These transient states are often hard to characterize because purifying cells in between more stable endpoint states can be difficult or impossible. Single-cell RNA-Seq can enable you to see these states without the need for purification. However, to do so, we must determine where each cell is in the range of possible states. For a comparison of different trajectory inference methods, see Saelens et al. (2019) [@RN30].

Monocle 3 [@RN75] is the latest version of the Monocle toolkit. Monocle introduced the strategy of using RNA-Seq for single-cell trajectory analysis [@RN76]. Rather than purifying cells into discrete states experimentally, Monocle uses an algorithm to learn the sequence of gene expression changes each cell must go through as part of a dynamic biological process. Once it has learned the overall "trajectory" of gene expression changes, Monocle can place each cell at its proper position in the trajectory. You can then use Monocle's differential analysis toolkit to find genes regulated over the course of the trajectory, as described in the section *Finding genes that change as a function of pseudotime*. If there are multiple outcomes for the process, Monocle will reconstruct a "branched" trajectory. These branches correspond to cellular "decisions", and Monocle provides powerful tools for identifying the genes affected by them and involved in making them. You can see how to analyze branches in the section *Analyzing branches in single-cell trajectories*.

The workflow for reconstructing trajectories is very similar to the workflow for clustering, but it has a few additional steps. To illustrate the workflow, we will use another *C. elegans* data set, this one from Packer & Zhu et al. [@RN77]. Their study includes a time series analysis of whole developing embyros. We will examine a small subset of the data which includes most of the neurons.

## 📥 Loading the data

We will load the data as we did with the L2 data. This dataset contains an expression matrix, cell metadata, and gene annotations.

```{r load-data}
expression_matrix <- readRDS(url("https://depts.washington.edu:/trapnell-lab/software/monocle3/celegans/data/packer_embryo_expression.rds"))
cell_metadata <- readRDS(url("https://depts.washington.edu:/trapnell-lab/software/monocle3/celegans/data/packer_embryo_colData.rds"))
gene_annotation <- readRDS(url("https://depts.washington.edu:/trapnell-lab/software/monocle3/celegans/data/packer_embryo_rowData.rds"))

cds <- new_cell_data_set(expression_matrix,
                         cell_metadata = cell_metadata,
                         gene_metadata = gene_annotation)
```

## ⚙️ Pre-process the data

Pre-processing works exactly as in clustering analysis. This time, we will use a different strategy for batch correction, which includes what Packer & Zhu et al did in their original analysis.

**Note:** Your data will not have the loading batch information demonstrated here, you will correct batch using your own batch information.

```{r preprocess-data}
cds <- preprocess_cds(cds, num_dim = 50)
cds <- align_cds(cds, alignment_group = "batch", residual_model_formula_str = "~ bg.300.loading + bg.400.loading + bg.500.1.loading + bg.500.2.loading + bg.r17.loading + bg.b01.loading + bg.b02.loading")
```

Note that in addition to using the `alignment_group` argument to `align_cds()`, which aligns groups of cells (i.e. batches), we are also using `residual_model_formula_str`. This argument is for subtracting continuous effects. You can use this to control for things like the fraction of mitochondrial reads in each cell, which is sometimes used as a QC metric for each cell. In this experiment (as in many scRNA-seq experiments), some cells spontanously lyse, releasing their mRNAs into the cell suspension immediately prior to loading into the single-cell library prep. This "supernatant RNA" contaminates each cells' transcriptome profile to a certain extent. Fortunately, it is fairly straightforward to estimate the level of background contamination in each batch of cells and subtract it, which is what Packer et al did in the original study. Each of the columns `bg.300.loading`, `bg.400.loading`, corresponds to a background signal that a cell might be contaminated with. Passing these colums as terms in the `residual_model_formula_str` tells `align_cds()` to subtract these signals prior to dimensionality reduction, clustering, and trajectory inference. Note that you can call `align_cds()` with `alignment_group`, `residual_model_formula`, or both.

## 🎨 Reduce dimensionality and visualize the results

Next, we reduce the dimensionality of the data. However, unlike clustering, which works well with both UMAP and t-SNE, here we strongly urge you to use UMAP, the default method:

```{r reduce-dimensionality}
cds <- reduce_dimension(cds)
p1 <- plot_cells(cds, label_groups_by_cluster=FALSE,  color_cells_by = "cell.type")
ggsave("figures/monocle3_trajectory_umap.pdf", p1, width=12, height=10)
ggsave("figures/monocle3_trajectory_umap.jpg", p1, width=12, height=10)
```

<div style="text-align: center;">
<img src="figures/monocle3_trajectory_umap.jpg" alt="UMAP of C. elegans embryonic cells colored by cell type" style="width: 80%;"/>
<p class="caption">**Figure 1:** UMAP of C. elegans embryonic cells colored by cell type</p>  
</div>

As you can see, despite the fact that we are only looking at a small slice of this dataset, Monocle reconstructs a trajectory with numerous branches. Overlaying the manual annotations on the UMAP reveals that these branches are principally occupied by one cell type.

As with clustering analysis, you can use `plot_cells()` to visualize how individual genes vary along the trajectory. Let's look at some genes with interesting patterns of expression in ciliated neurons:

```{r plot-ciliated-genes}
ciliated_genes <- c("che-1",
                    "hlh-17",
                    "nhr-6",
                    "dmd-6",
                    "ceh-36",
                    "ham-1")

p2 <- plot_cells(cds,
           genes=ciliated_genes,
           label_cell_groups=FALSE,
           show_trajectory_graph=FALSE)
ggsave("figures/monocle3_trajectory_ciliated_genes.pdf", p2, width=12, height=10)
ggsave("figures/monocle3_trajectory_ciliated_genes.jpg", p2, width=12, height=10)
```

<div style="text-align: center;">
<img src="figures/monocle3_trajectory_ciliated_genes.jpg" alt="Expression of ciliated neuron genes along the trajectory" style="width: 80%;"/>
<p class="caption">**Figure 2:** Expression of ciliated neuron genes along the trajectory</p>
</div>



We will learn how to identify the genes that are restricted to each outcome of the trajectory later on in the section *Finding genes that change as a function of pseudotime*.

## 🔬 Cluster your cells

Although cells may continuously transition from one state to the next with no discrete boundary between them, Monocle does not assume that all cells in the dataset descend from a common transcriptional "ancestor". In many experiments, there might in fact be multiple distinct trajectories. For example, in a tissue responding to an infection, tissue resident immune cells and stromal cells will have very different initial transcriptomes, and will respond to infection quite differently, so they should be a part of the same trajectory.

Monocle is able to learn when cells should be placed in the same trajectory as opposed to separate trajectories through its clustering procedure. Recall that we run `cluster_cells()`, each cell is assigned not only to a cluster but also to a partition. When you are learning trajectories, each partition will eventually become a separate trajectory. We run `cluster_cells()` as before.

```{r cluster-cells}
cds <- cluster_cells(cds)
p3 <- plot_cells(cds, color_cells_by = "partition")
ggsave("figures/monocle3_trajectory_partitions.pdf", p3, width=12, height=10)
ggsave("figures/monocle3_trajectory_partitions.jpg", p3, width=12, height=10)
```

<div style="text-align: center;">
<img src="figures/monocle3_trajectory_partitions.jpg" alt="UMAP of C. elegans embryonic cells colored by partition" style="width: 80%;"/>
<p class="caption">**Figure 3:** UMAP of C. elegans embryonic cells colored by partition</p>
</div>

## 🕸️ Learn the trajectory graph

Next, we will fit a principal graph within each partition using the `learn_graph()` function:

```{r learn-graph, echo=TRUE, results='hide'}
cds <- learn_graph(cds)
```
```{r plot-graph}
p4 <- plot_cells(cds,
           color_cells_by = "cell.type",
           label_groups_by_cluster=FALSE,
           label_leaves=FALSE,
           label_branch_points=FALSE)

ggsave("figures/monocle3_trajectory_graph.pdf", p4, width=12, height=10)
ggsave("figures/monocle3_trajectory_graph.jpg", p4, width=12, height=10)

```

<div style="text-align: center;">
<img src="figures/monocle3_trajectory_graph.jpg" alt="UMAP of C. elegans embryonic cells with learned trajectory graph" style="width: 80%;"/>
<p class="caption">**Figure 4:** UMAP of C. elegans embryonic cells with learned trajectory graph</p>
</div>


This graph will be used in many downstream steps, such as branch analysis and differential expression.

## ⏳ Order the cells in pseudotime

Once we've learned a graph, we are ready to order the cells according to their progress through the developmental program. Monocle measures this progress in pseudotime.

> **What is pseudotime?**
>
> Pseudotime is a measure of how much progress an individual cell has made through a process such as cell differentiation.
>
> In many biological processes, cells do not progress in perfect synchrony. In single-cell expression studies of processes such as cell differentiation, captured cells might be widely distributed in terms of progress. That is, in a population of cells captured at exactly the same time, some cells might be far along, while others might not yet even have begun the process. This asynchrony creates major problems when you want to understand the sequence of regulatory changes that occur as cells transition from one state to the next. Tracking the expression across cells captured at the same time produces a very compressed sense of a gene's kinetics, and the apparent variability of that gene's expression will be very high.
>
> By ordering each cell according to its progress along a learned trajectory, Monocle alleviates the problems that arise due to asynchrony. Instead of tracking changes in expression as a function of time, Monocle tracks changes as a function of progress along the trajectory, which we term "pseudotime". Pseudotime is an abstract unit of progress: it's simply the distance between a cell and the start of the trajectory, measured along the shortest path. The trajectory's total length is defined in terms of the total amount of transcriptional change that a cell undergoes as it moves from the starting state to the end state.

In order to place the cells in order, we need to tell Monocle where the "beginning" of the biological process is. We do so by choosing regions of the graph that we mark as "roots" of the trajectory. In time series experiments, this can usually be accomplished by finding spots in the UMAP space that are occupied by cells from early time points:

```{r plot-time-bin}
p5 <- plot_cells(cds,
           color_cells_by = "embryo.time.bin",
           label_cell_groups=FALSE,
           label_leaves=TRUE,
           label_branch_points=TRUE,
           graph_label_size=1.5)
ggsave("figures/monocle3_trajectory_time_bin.pdf", p5, width=12, height=10)
ggsave("figures/monocle3_trajectory_time_bin.jpg", p5, width=12, height=10)

```

<div style="text-align: center;">
<img src="figures/monocle3_trajectory_time_bin.jpg" alt="UMAP of C. elegans embryonic cells colored by embryo time bin" style="width: 80%;"/>
<p class="caption">**Figure 5:** UMAP of C. elegans embryonic cells colored by embryo time bin</p>
</div>

The black lines show the structure of the graph. Note that the graph is not fully connected: cells in different partitions are in distinct components of the graph. The circles with numbers in them denote special points within the graph. Each leaf, denoted by light gray circles, corresponds to a different outcome (i.e. cell fate) of the trajectory. Black circles indicate branch nodes, in which cells can travel to one of several outcomes. You can control whether or not these are shown in the plot with the `label_leaves` and `label_branch_points` arguments to `plot_cells`. Please note that numbers within the circles are provided for reference purposes only.

Now that we have a sense of where the early cells fall, we can call `order_cells()`, which will calculate where each cell falls in pseudotime. In order to do so `order_cells()` needs you to specify the root nodes of the trajectory graph. If you don't provide them as an argument, it will launch a graphical user interface for selecting one or more root nodes.

```{r define-get-root-node}
# a helper function to identify the root principal points:
get_earliest_principal_node <- function(cds, time_bin="130-170"){
  cell_ids <- which(colData(cds)[, "embryo.time.bin"] == time_bin)

  closest_vertex <-
  cds@principal_graph_aux[["UMAP"]]$pr_graph_cell_proj_closest_vertex
  closest_vertex <- as.matrix(closest_vertex[colnames(cds), ])
  root_pr_nodes <-
  igraph::V(principal_graph(cds)[["UMAP"]])$name[as.numeric(names
  (which.max(table(closest_vertex[cell_ids,]))))]

  root_pr_nodes
}
```

```{r order-cells-programmatic}
cds <- order_cells(cds, root_pr_nodes=get_earliest_principal_node(cds))
```

In the above example, we just chose one location, but you could pick as many as you want. Plotting the cells and coloring them by pseudotime shows how they were ordered:

```{r plot-pseudotime-interactive}

p6 <- plot_cells(cds,
           
           color_cells_by = "pseudotime",
           label_cell_groups=FALSE,
           label_leaves=FALSE,
           label_branch_points=FALSE,
           graph_label_size=1.5)

ggsave("figures/monocle3_trajectory_pseudotime_interactive.pdf", p6, width=12, height=10)
ggsave("figures/monocle3_trajectory_pseudotime_interactive.jpg", p6, width=12, height=10)

```

<div style="text-align: center;">
<img src="figures/monocle3_trajectory_pseudotime_interactive.jpg" alt="UMAP of C. elegans embryonic cells colored by pseudotime" style="width: 80%;"/>
<p class="caption">**Figure 6:** UMAP of C. elegans embryonic cells colored by pseudotime</p>
</div>
  

Note that some of the cells are gray. This means they have infinite pseudotime, because they were not reachable from the root nodes that were picked. In general, any cell on a partition that lacks a root node will be assigned an infinite pseudotime. In general, you should choose at least one root per partition.

It's often desirable to specify the root of the trajectory programmatically, rather than manually picking it. The function below does so by first grouping the cells according to which trajectory graph node they are nearest to. Then, it calculates what fraction of the cells at each node come from the earliest time point. Then it picks the node that is most heavily occupied by early cells and returns that as the root.

```{r get-root-node}
# a helper function to identify the root principal points:
get_earliest_principal_node <- function(cds, time_bin="130-170"){
  cell_ids <- which(colData(cds)[, "embryo.time.bin"] == time_bin)

  closest_vertex <-
  cds@principal_graph_aux[["UMAP"]]$pr_graph_cell_proj_closest_vertex
  closest_vertex <- as.matrix(closest_vertex[colnames(cds), ])
  root_pr_nodes <-
  igraph::V(principal_graph(cds)[["UMAP"]])$name[as.numeric(names
  (which.max(table(closest_vertex[cell_ids,]))))]

  root_pr_nodes
}
cds <- order_cells(cds, root_pr_nodes=get_earliest_principal_node(cds))
```

Passing the programatically selected root node to `order_cells()` via the `root_pr_nodes` argument yields:

```{r plot-pseudotime-programmatic}
p7 <- plot_cells(cds,
           color_cells_by = "pseudotime",
           label_cell_groups=FALSE,
           label_leaves=FALSE,
           label_branch_points=FALSE,
           graph_label_size=1.5)

ggsave("figures/monocle3_trajectory_pseudotime_programmatic.pdf", p7, width=12, height=10)
ggsave("figures/monocle3_trajectory_pseudotime_programmatic.jpg", p7, width=12, height=10)

```

<div style="text-align: center;">
<img src="figures/monocle3_trajectory_pseudotime_programmatic.jpg" alt="UMAP of C. elegans embryonic cells colored by pseudotime" style="width: 80%;"/>
<p class="caption">**Figure 7:** UMAP of C. elegans embryonic cells colored by pseudotime</p>
</div>


Note that we could easily do this on a per-partition basis by first grouping the cells by partition using the `partitions()` function. This would result in all cells being assigned a finite pseudotime.

## 🌿 Subset cells by branch

It is often useful to subset cells based on their branch in the trajectory. The function `choose_graph_segments` allows you to do so interactively.

```{r subset-branch, eval=FALSE}
cds_sub <- choose_graph_segments(cds)
```

## 🧊 Working with 3D trajectories

You can also perform trajectory analysis in 3D.

```{r 3d-trajectory, echo=TRUE, results='hide'}
cds_3d <- reduce_dimension(cds, max_components = 3)
cds_3d <- cluster_cells(cds_3d)
cds_3d <- learn_graph(cds_3d)
```

```{r order-3d-cells}
cds_3d <- order_cells(cds_3d, root_pr_nodes=get_earliest_principal_node(cds))

# Get the UMAP coordinates and partition information
umap_coords <- reducedDims(cds_3d)$UMAP
cell_partitions <- partitions(cds_3d)

# Create a data frame for plotting
plot_df <- data.frame(
  UMAP_1 = umap_coords[,1],
  UMAP_2 = umap_coords[,2],
  UMAP_3 = umap_coords[,3],
  Partition = factor(cell_partitions)
)

# Create and display interactive 3D plot with plotly
# Create the 3D plot
fig <- plot_ly(plot_df, 
              x = ~UMAP_1, 
              y = ~UMAP_2, 
              z = ~UMAP_3,
              color = ~Partition,
              type = "scatter3d",
              mode = "markers",
              marker = list(size = 3)) %>%
  layout(scene = list(
    xaxis = list(title = "UMAP_1"),
    yaxis = list(title = "UMAP_2"),
    zaxis = list(title = "UMAP_3")
  ))

# Display the plot directly
fig

```






## References